Skip to content

(feat): CUDA arr_wrappers — Zero-Alloc CuArray Reuse via setfield!#29

Merged
mgyoo86 merged 7 commits intomasterfrom
feat/CuArray_wrappers
Mar 11, 2026
Merged

(feat): CUDA arr_wrappers — Zero-Alloc CuArray Reuse via setfield!#29
mgyoo86 merged 7 commits intomasterfrom
feat/CuArray_wrappers

Conversation

@mgyoo86
Copy link
Member

@mgyoo86 mgyoo86 commented Mar 11, 2026

Summary

Replaces the N-way view cache with arr_wrappers + setfield!-based CuArray reuse — the same zero-allocation strategy already used on CPU (Julia 1.11+). This eliminates the 4-way cache eviction limit: unlimited dimension patterns per N are now zero-alloc.

Key Changes

New: arr_wrappers Cache (types.jl, acquire.jl)

  • CuTypedPool gains arr_wrappers::Vector{Union{Nothing, Vector{Any}}} (indexed by dimensionality N, per-slot cached CuArray{T,N})
  • Cache hit: setfield!(wrapper, :dims, dims) — zero allocation
  • DataRef identity check (wrapper.data.rc !== vec.data.rc, ~2ns) minimizes refcount overhead — only updates :data when GPU buffer actually changed (rare grow-beyond-capacity case)
  • Removes CACHE_WAYS, views, view_dims, next_way fields and Preferences dependency

New: _resize_to_fit! (acquire.jl)

  • Capacity-aware CuVector resize: setfield!(:dims) when within maxsize, delegates to resize! only when beyond capacity
  • Superset of old _resize_without_shrink! — also optimizes grow-within-capacity (critical for re-acquire after safety invalidation)

New: Direct _acquire_impl! Dispatch (acquire.jl)

  • CUDA overrides for _acquire_impl! / _unsafe_acquire_impl! route directly to get_array! (no get_view!get_array! indirection)
  • get_view! retained for backward compat only

New: _reshape_impl! for CuArray (acquire.jl)

  • Same-dim: in-place setfield!(:dims), no pool interaction
  • Cross-dim: claims slot, reuses cached CuArray{T,N} wrapper

Fix: Safety Invalidation (debug.jl, state.jl)

  • _invalidate_released_slots! for CuTypedPool: poison fill + _resize_to_fit!(vec, 0) + arr_wrappers dims zeroing
  • Backing vector length correctly restored in _cuda_claim_slot! after invalidation
  • _zero_dims_tuple(N): literal tuples for N≤4, avoids ntuple dynamic-dispatch allocation

Test Coverage

  • Same-N unlimited patterns (4, 5+ patterns) — GPU & CPU zero-alloc
  • Mixed-N (1D + 2D + 3D) — single and multi-slot
  • reshape! zero-alloc (cross-dim, same-dim, mixed sequence)
  • _resize_to_fit! unit tests (shrink, grow-back, after-invalidation, beyond-capacity)
  • Safety invalidation (arr_wrappers dims zeroed, backing vector restored)
  • acquire! and unsafe_acquire! both paths
  • Loop patterns (100 iterations)

Breaking Changes

None. Public API is unchanged. Internal CuTypedPool struct fields changed (N-way cache fields removed, arr_wrappers added).

mgyoo86 added 4 commits March 11, 2026 10:27
Replace N-way round-robin cache with arr_wrappers[N][slot] pattern
(CPU parity). Key changes:

- CuTypedPool: views/view_dims/next_way → arr_wrappers
- _resize_to_fit!: capacity-aware resize (superset of _resize_without_shrink!)
- _cuda_claim_slot!: maxsize-based capacity check avoids spurious GPU realloc
- get_array!: DataRef identity check (cu.data.rc !== vec.data.rc) for
  zero-overhead common path, refcount update only on rare grow-beyond-capacity
- _reshape_impl! for CuArray: same-N setfield!, different-N cached wrapper
- Safety invalidation updated for arr_wrappers + _resize_to_fit!
- Remove CACHE_WAYS constant and Preferences dependency
…afety invalidation

_cuda_claim_slot! only checked maxsize-based capacity but didn't restore
the logical length of backing vectors after safety invalidation (which
sets dims to (0,)). This caused escape detection (Level 3) to fail
because vectors appeared empty during overlap checks.

Replace manual capacity check with _resize_to_fit! which handles all
cases: capacity growth, length restoration, and no-op hot path.
CUDA _acquire_impl! and _unsafe_acquire_impl! now route directly to
get_array!, eliminating the get_view! → get_array! indirection. On CUDA,
view/array distinction is meaningless (both return CuArray), so all
acquire paths converge to the same arr_wrappers-based get_array!.

get_view! stubs kept for backward compat (direct callers) but no longer
on the main acquire path.

Add 8 Mixed-N pattern tests (1D+2D+3D) verifying zero-alloc for both
GPU (CUDA.@allocated) and CPU (@allocated) across same-slot different-N,
multi-slot mixed-N, varying dims, and unsafe_acquire! variants.
- Add _zero_dims_tuple(N) helper: literal tuples for N≤4 avoid
  ntuple(_ -> 0, N) dynamic-dispatch allocation on safety invalidation
- Apply to TypedPool, BitTypedPool (CPU), CuTypedPool, legacy BitTypedPool
- Fix _unsafe_acquire_impl! NTuple overload: delegate to Vararg (matches
  _acquire_impl! and CPU pattern)
- Add CUDA reshape! zero-alloc tests (cross-dim, same-dim, mixed, correctness)
@mgyoo86 mgyoo86 requested a review from Copilot March 11, 2026 19:45
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the CUDA backend to reuse CuArray{T,N} wrappers via an arr_wrappers cache and setfield!, removing the prior fixed-size N-way cache and aiming for zero CPU allocations across unlimited same-N dimension patterns.

Changes:

  • Replace CUDA’s N-way view cache with arr_wrappers-based wrapper reuse (setfield! on :dims, plus DataRef update on rare buffer changes).
  • Introduce _resize_to_fit! and capacity-based slot claiming to avoid unnecessary GPU reallocations (especially after safety invalidation).
  • Extend/refresh CUDA tests to validate zero-alloc behavior for same-N, mixed-N, and reshape!, plus update safety invalidation expectations.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ext/AdaptiveArrayPoolsCUDAExt/types.jl Updates CuTypedPool struct to add arr_wrappers and remove N-way cache fields.
ext/AdaptiveArrayPoolsCUDAExt/acquire.jl Implements _resize_to_fit!, capacity-based slot claim, arr_wrappers lookup/store, and CUDA _reshape_impl!.
ext/AdaptiveArrayPoolsCUDAExt/debug.jl Updates CUDA safety invalidation to shrink via _resize_to_fit! and invalidate arr_wrappers by zeroing dims.
ext/AdaptiveArrayPoolsCUDAExt/state.jl Updates empty! to clear arr_wrappers instead of N-way cache vectors.
ext/AdaptiveArrayPoolsCUDAExt/AdaptiveArrayPoolsCUDAExt.jl Removes Preferences-based CACHE_WAYS config and reframes extension as arr_wrappers-based.
src/state.jl Adds _zero_dims_tuple helper and uses it during wrapper invalidation.
src/legacy/state.jl Updates legacy BitArray invalidation to call _zero_dims_tuple.
test/cuda/test_nway_cache.jl Renames/expands tests from N-way cache to arr_wrappers, adds mixed-N + reshape! coverage.
test/cuda/test_extension.jl Updates struct-field assertions to expect :arr_wrappers.
test/cuda/test_cuda_safety.jl Updates safety invalidation assertions to check arr_wrappers dims are zeroed.
test/cuda/test_allocation.jl Updates resize tests to _resize_to_fit! and adds grow-within-capacity-after-invalidation coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@codecov
Copy link

codecov bot commented Mar 11, 2026

Codecov Report

❌ Patch coverage is 53.33333% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.54%. Comparing base (616373a) to head (68ebaab).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/state.jl 50.00% 4 Missing ⚠️
src/legacy/state.jl 57.14% 3 Missing ⚠️

❌ Your patch status has failed because the patch coverage (53.33%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #29      +/-   ##
==========================================
- Coverage   96.79%   96.54%   -0.26%     
==========================================
  Files          14       14              
  Lines        2620     2632      +12     
==========================================
+ Hits         2536     2541       +5     
- Misses         84       91       +7     
Files with missing lines Coverage Δ
src/legacy/state.jl 96.71% <57.14%> (-1.05%) ⬇️
src/state.jl 96.32% <50.00%> (-1.43%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mgyoo86 mgyoo86 merged commit b4fa5a3 into master Mar 11, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants